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Abstract 

The interactions between three or more random variables are often nontrivial, poorly understood, 
and yet, are paramount for future advances in fields such as network information theory, neuroscience, 
genetics and many others. In this work, we propose to analyze these interactions as different modes of 
information sharing. Towards this end, we introduce a novel axiomatic framework for decomposing the 
joint entropy, which characterizes the various ways in which random variables can share information. The 
key contribution of our framework is to distinguish between interdependencies where the information 
is shared redundantly, and synergistic interdependencies where the sharing structure exists in the whole 
but not between the parts. We show that our axioms determine unique formulas for all the terms of the 
proposed decomposition for a number of cases of interest. Moreover, we show how these results can 
be applied to several network information theory problems, providing a more intuitive understanding of 
their fundamental limits. 
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I. Introduction 

Interdependence is a key concept for understanding the rich structures that can he exhibited hy 
biological, economical and social systems [1], [2]. Although this phenomenon lies in the heart 
of our modem interconnected world, there is still no solid quantitative framework for analyzing 
complex interdependences, this being cmcial for future advances in a number of disciplines. In 
neuroscience, researchers desire to identify how various neurons affect an organism’s overall 
behavior, asking to what extent the different neurons are providing redundant or synergistic sig¬ 
nals [3]. In genetics, the interactions and roles of multiple genes with respect to phenotypic phe¬ 
nomena are studied, e.g. by comparing results from single and double knockout experiments [4]. 
In graph and network theory, researchers are looking for measures of the information encoded 
in node interactions in order to quantify the complexity of the network [5]. In communication 
theory, sensor networks usually generate strongly correlated data [6]; a haphazard design might 
not account for these interdependencies and, undesirably, will process and transmit redundant 
information across the network degrading the efficiency of the system. 

The dependencies that can exist between two variables have been extensively studied, gener¬ 
ating a variety of techniques that range from statistical inference [7] to information theory [8]. 
Most of these approaches require that one differentiate the role of the variables, e.g. between a 
target and predictor. However, the extension of these approaches to three or more variables is not 
straightforward, as a binary splitting is, in general, not enough to characterize the rich interplay 
that can exist between variables. Moreover, the development of more adequate frameworks has 
been difficult as most of our theoretical tools are rooted in sequential reasoning, which is adept 
at representing linear flows of influences but not as well-suited for describing distributed systems 
or complex interdependencies [9]. 

In this work, we propose to understand interdependencies between variables as information 
sharing. In the case of two variables, the portion of the variability that can be predicted cor¬ 
responds to information that target and predictor have in common. Following this intuition, we 
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present a framework that deeomposes the total information of a distribution according to how it is 
shared among its variables. Our framework is novel in combining the hierarchical decomposition 
of higher-order interactions, as developed in [10], with the notion of synergistic information, as 
proposed in [11]. In contrast to [10], we study the information that exists in the system itself 
without comparing it with other related distributions. In contrast to [11], we analyze the joint 
entropy instead of the mutual information, looking for symmetric properties of the system. 

One important contribution of this paper is to distinguish shared information from pre¬ 
dictability. Predictability is a concept that requires a bipartite system divided into predictors 
and targets. As different splittings of the same system often yield different conclusions, we see 
predictability as a directed notion that strongly depends on one’s “point of view”. In contrast, we 
see shared information as a property of the system itself, which does not require differentiated 
roles between its components. Although it is not possible in general to find an unique measure 
of predictability, we show that the shared information can be uniquely defined for a number of 
interesting scenarios. 

Additionally, our framework provides new insight to various problems of network information 
theory. Interestingly, many of the problems of network information theory that have been solved 
are related to systems which present a simple structure in terms of shared information and 
synergies, while most of the open problems possess a more complex mixture of them. 

The rest of this article is structured as follows. First, Section II introduces the notions of 
hierarchical decomposition of dependencies and synergistic information, reviewing the state-of- 
the-art and providing the necessary background for the unfamiliar reader. Section III presents 
our axiomatic decomposition for the joint entropy, focusing on the fundamental case of three 
random variables. Then, we illustrate the application of our framework for various cases of 
interest: pairwise independent variables in Section IV, pairwise maximum entropy distributions 
and Markov chains in Section V, and multivariate Gaussians in VI. After that. Section VII 
presents a first application of this framework in settings of fundamental importance for network 
information theory. Finally, Section VIII summarizes our main conclusions. 

II. Preliminaries and state of the art 

One way of analyzing the interactions between the random variables X = (Xi,...,XAr) 
is to study the properties of the correlation matrix T^x = EjXX*}. However, this approach 
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only captures linear relationships and hence the picture provided by IZy^ is ineomplete. Another 
possibility is to study the matrix Xx = [I{Xi] Xj)]ij of mutual information terms. This matrix 
eaptures the existence of both linear and nonlinear dependeneies [ 12 ], but its scope is restrieted 
to pairwise relationships and thus misses all higher-order structure. To see an example of how 
this ean happen, eonsider two independent fair eoins Xi and X 2 and let X 3 := Xi © X 2 be the 
output of an XOR logie gate. The mutual information matrix Xx has all its off-diagonal elements 
equal to zero, making it indistinguishable from an alternative situation where X 3 is just another 
independent fair eoin. 

For the ease of 7?.x, a possible next step would be to eonsider higher-order moment matriees, 
sueh as eo-skewness and eo-kurtosis. We seek their information-theoretie analogs, whieh eomple- 
ment the deseription provided by Xx. One method of doing this is by studying the information 
eontained in marginal distributions of inereasingly larger sizes; this approaeh is presented in 
Seetion II-A. Other methods try to provide a direet representation of the information that is 
shared between the random variables; they are diseussed in Seetions II-B, II-C and II-D. 

A. Negentropy and total correlation 

When the random variables that eompose a system are independent, their joint distribution is 
given by the produet of their marginal distributions. In this ease, the marginals eontain all that is 
to be learned about the statisties of the entire system. For an arbitrary joint probability density 
funetion (p.d.f.), knowing the single variable marginal distributions is not enough to eapture all 
there is to know about the statisties of the system. 

To quantify this idea, let us eonsider N diserete random variables X = (Xi,... ,X]\f) with 
joint p.d.f. px, where eaeh Xj takes values in a finite set with eardinality flj. The maximal 
amount of information that eould be stored in any sueh system is logoi’ whieh 

eorresponds to the entropy of the p.d.f. pu ;= YljPxj^ where Px^{x) = 1/Vtj is the uniform 
distribution for eaeh random variable Xj. On the other hand, the joint entropy H(X.) with respeet 
to the true distribution px measures the aetual uneertainty that the system possesses. Therefore, 
the difference 

Af(X) := - H(X) (1) 

eorresponds to the deerease of the uneertainty about the system that occurs when one learns its 
p.d.f. - i.e. the information about the system that is eontained in its statisties. This quantity is 


4 


known as negentropy [13], and can also be computed as 


M(Xu ..., A'«) = J^llogSJ, - H{Xj)] + 5^ H{X,) - H{X) 


( 2 ) 



where pxj is the marginal of the variable Xj and -D(-||-) is the Kullback-Leibler divergence. In 
this way, (3) decomposes the negentropy into a term that corresponds to the information given 
by simple marginals and a term that involves higher-order marginals. The second term is known 
as the total correlation (TC) [14] (also known as multi-information [15]), which is equal to the 
mutual information for the case of = 2. Because of this, the TC has been suggested as an 
extension of the notion of mutual information for multiple variables. 

An elegant framework for decomposing the TC can be found in [10] (for an equivalent 
formulation that do not rely on information geometry c.f. [16]). Let us call fc-marginals the 
distributions that are obtained by marginalizing the joint p.d.f. over N — k variables. Note that 
the fc-marginals provide a more detailed description of the system than the (k — l)-marginals, as 
the latter can be directly computed from the former by marginalizing the corresponding variables. 
In the case where only the 1-marginals are known, the simplest guess for the joint distribution 
is = n, pxy One way of generalizing this for the case where the fc-marginals are known 
is by using the maximum entropy principle [17], which suggests to choose the distribution that 
maximizes the joint entropy while satisfying the constrains given by the partial (fc-marginal) 
knowledge. Let us denote by the p.d.f. which achieves the maximum entropy while being 
consistent with all the fc-marginals, and let denote its entropy. Note that 

since the number of constrains that are involved in the maximization process 
that generates 77^^^ increases with k. It can therefore be shown that the following generalized 
Pythagorean relationship holds for the total correlation: 

N N N 

TC = 77^^)-i7(X) = ^ := . (4) 

k=2 k=2 k=2 


Above, > 0 measures the additional information that is provided by the 7-marginals that 

was not contained in the description of the system given by the (7 — 1)-marginals. In general, 
the information that is located in terms with higher values of 7 is due to dependencies between 
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groups of variables that cannot be reduced to combinations of dependencies between smaller 
groups. 

It has been observed that in many practical scenarios most of the TC of the measured data 
is provided by the lower marginals. It can be shown that percentage of the TC that is lost by 
considering only the /co-order marginals is given by 


TC - 


N 


TC 


TC 




k=ko-\-l 




( 5 ) 


This quantity is small if there exists a value of ko such that provides an accurate ap¬ 
proximation for the joint p.d.f. of the system. Interestingly, it has been shown that pairwise 
maximum entropy models (i.e. fco = 2) can provide an accurate description of the statistics of 
many biological systems [18]-[21] and also some social organizations [22], [23]. 


B. Internal and external decompositions 

An alternative approach to study the interdependencies between many random variables is to 
analyze the ways in which they share information. This can be done by decomposing the joint 
entropy of the system. For the case of two variables, the joint entropy can be decomposed as 

H{Xi,X 2) = I{Xp,X2) + H{X,\X2) + H{X2 \Xi) , (6) 

suggesting that it can be divided into shared information, I{Xi;X 2 ), and into terms which 
represent information that is exclusively located in a single variable, i.e., H{Xi\X 2 ) for Xi and 
HiX 2 \X,) for X 2 . 

In systems with more than two variables, one can compute the total information that is 
exclusively located in one variable as if(i) := where denotes all the system’s 

variables except Xj. The difference between the joint entropy and the sum of all exclusive 
information terms, Ff(i), defines a quantity known [24] as the dual total correlation (DTC)^: 

DTC = i7(X)-Ff(i), (7) 


'The superscripts and subscripts are used to reflect that > tT(X) > JT(i). 

^The DTC is also known as excess entropy in [25], whose definition differs from its typical use in the context of time series, 
e.g. [26], 
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which measures the portion of the joint entropy that is shared between two or more variables of 
the system. When N = 2 then DTC = I{Xi; X 2 ), and henee the DTC has also been suggested 
in the literature as a measure for the multivariate mutual information. 

By eomparing (4) and (7), it would be appealing to look for a deeomposition of the DTC of the 
form DTC = Yl!k= 2 ^^{k)^ where AH(^k) > 0 would measure the information that is shared by 
exaetly k variables [27]. With this, one eould define an internal entropy i7(j) = 77 {i)+X]i =2 ^-^(*) 
as the information that is shared between at most j variables, in eontrast to the external entropy 
= //(i) — whieh deseribes the information provided by the j-marginals. These 

entropies form a non-deereasing sequenee: 

7/(1) < • • ■ < //( 7 V- 1 ) < H{X) < //(^-i) < ■ ■ ■ < 77(^) . ( 8 ) 

This layered strueture, and its relationship with the TC and the DTC, is graphieally represented 
in Figure 1. 


A//(2) A//(3 ) A/7(jv) A/7(^) A/7(2) 

I-II-1 ••• I-II-1 ••• I-1 

■ DTC : TC i 

I I I 


^ iT(i) F(X) 

> joint entropy i> negentropy- i 


Fig. 1. Layers of internal and external entropies that decompose the DTC and the TC. Each shows how much information 

is contained in the /-marginals, while each AFfy) measures the information is shared between exactly j variables. 


It is interesting to note that even though the TC and DTC eoineide for the ease of = 2, these 
quantities are in general different for larger system sizes. Therefore, in general A//)^) 7 ^ 
although it is appealing to believe that there should exist a relationship between them. One of 
the goals of this paper is to explore the differenee between these quantities. 

C. Inclusion-exclusion decompositions 

Perhaps the most natural approaeh to deeompose the DTC and joint entropy is to apply the 
inelusion-exelusion prineiple, using a simplifying analogy that the entropies and areas have 
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similar properties. A refined version of this approaeh ean be found in and also in the 7- 
measures [28] and in the multi-scale complexity [29]. For the ease of three variables, this approaeh 
gives 

DTCAr=3 = /(Xi;X2|X3) + /(X2;X3|Xi) + J(X3;Xi|X2) + /(Xi;X2;X3) . (9) 

The last term is known as the co-information [30] (being elosely related to the interaction 
information [31]), and ean be defined using the inelusion-exelusion prineiple as 

/(Xi; X2; X3) :=H{Xf + HiX^) + HiX^) - H{X,, X^) - 77(X2, X3) 

- 77(Xi,X3) + 77(Xi,X2,X3) (10) 

=J(Xi;X2)-/(Xi;X2|X3) . (11) 

As /(Xi;X 2 ;X 2 ) = /(Xi;X 2 ), the eo-information has also been proposed as a eandidate for 
extending the mutual information to multiple variables. For a summary of the various possible 
extensions of the mutual information, see Table I and also additional diseussion in Ref. [32]. 

TABLE I 

Summary of the candidates for extending the mutual information for N > S . 


Name 

Formula 

Total correlation 

Dual total correlation 

Co-information 

TC = E, H{Xf - H{X) 

DTC = tT(X)-E,J^(W|Xj) 
I{Xi-,X 2-,X3) = I{Xi-X2) - /(X^XalXa) 


It is tempting to eoarsen the deeomposition provided by this approaeh in order to build a 
deeomposition for the DTC. In this deeomposition, the eo-information assoeiates to A77(3), and 
the the remaining terms of (9) assoeiate to A77(2). With this, one ean build a Venn diagram 
for the information sharing between three variables, as in Figure 2. However, the resulting 
deeomposition and diagram are not very intuitive sinee the eo-information ean be negative. 

As part of this temptation, it is appealing to eonsider the eonditional mutual information 
/(Xi; X2IX3) as the information eontained in Xi and X2 that is not eontained in X3, just as the 
eonditional entropy 77(Xi|X2) is the information that is in Xi and not in X 2 . However, the latter 









I—1^(1) 

I I PTC 


> joint entropy < 


Fig. 2. An approach based on the I-measures decomposes the total entropy of three variables H{X, Y, Z) into 7 signed areas. 


interpretation works because conditioning always reduces entropy (i.e., H{Xi) > H{Xi\X 2 )) 
while this is not true for mutual information; that is, in some cases the conditional mutual 
information I{Xi]X 2 \X'i) can be greater than I{Xi]X 2 ). This suggests that the conditional 
mutual information can capture information that extends beyond Xi and X 2 , incorporating 
higher-order effects with respect to X 3 . Therefore, a better understanding of the conditional 
mutual information is required in order to refine the decomposition suggested by (9). 

D. Synergistic information 

An extended treatment of the conditional mutual information and its relationship with the 
mutual information decomposition can be found in [33], [34]. For presenting these ideas, let us 
consider two random variables Xi and X 2 which are used to predict Y . The total predictability^, 
i.e., the part of the randomness of Y that can be predicted by Xi and X2, can be expressed 
using the chain rule of the mutual information as"^ 

/(X 1 X 2 ; F) = I{Xy Y) + /(X 2 ; Y\Xf . (12) 


^Note that the term total predictability has also been used in [26] with a definition that differs from our current usage. 
''For simplicity, through the paper we use the shorthand notation XY = (A, Y). 
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It is natural to think that the predictability provided by Xi, which is given by the term /(Xi; Y), 
can be either unique or redundant with respect of the information provided by X 2 . On the other 
hand, due to (12) is clear that the unique predictability contributed by X 2 must be contained 
in /(X 2 ;X|Xi). However, the fact that J(X 2 ;X|Xi) can be larger than J(X 2 ;X) —while the 
latter contains both the unique and redundant contributions of X 2 — suggests that there can be 
an additional predictability that is accounted for only by the conditional mutual information. 

Following this rationale, we denote as synergistic predictability the part of the conditional 
mutual information that corresponds to evidence about the target that is not contained in any 
single predictor, but is only revealed when both are known. As an example of this, consider 
again the case in which Xi and X 2 are independent random bits and Y = Xi © X 2 . Then, it 
can be seen that /(Xi;X) = I{X 2 ]Y) = 0 but I{XiX 2 ]Y) = /(Xi;X|X 2 ) = 1. Hence, neither 
Xi nor X 2 individually provide information about Y, although together they fully determine it. 

Further discussions about the notion of information synergy can be found in [11], [35]-[37]. 

HI. A NON-NEGATIVE JOINT ENTROPY DECOMPOSITION 

Following the discussion presented in Section II-B, we search for a decomposition of the joint 
entropy that reflects the private, common and synergistic modes of information sharing. In this 
way, we want the decomposition to distinguish information that is shared only by few variables 
from information that accessible from the entire system. 

Our framework is based on distinguishing the directed notion of predictability from the 
undirected one of information. It is to be noted that there is an ongoing debate about the best way 
of characterizing and computing the predictability in arbitrary systems, as the commonly used 
axioms are not enough for specifying a unique formula that satisfies them [35]. Nevertheless, 
our approach is to explore how far one can reach based an axiomatic approach. In this way, our 
results are going to be consistent with any choice of formula that is consistent with the discussed 
axioms. 

In the following. Sections III-A, III-B and III-C discuss the basic features of predictability and 
information. After these necessary preliminaries. Section III-D finally presents our joint entropy 
decomposition for discrete and continuous variables. 
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A. Predictability axioms 

Let us consider two variables Xi and X 2 that are used to predict a target variable Y := X 3 . 
Intuitively, I{Xi; Y) quantifies the predictability of Y that is provided by Xi. In the following, 
we want to find a function lZ{XiX 2 Y) that measures the redundant predicability provided 
by Xi with respect to the predictability provided by X 2 , and a function U{Xi Y\X 2 ) that 
measures the unique predictability that is provided by Xi but not by X 2 . Following [33], we 
first determine a number of desired properties that these functions should have. 

Definition A predictability decomposition is defined by the real-valued functions lZ{XiX 2 ^Y) 
and U{Xi^Y\X 2 ) over the distributions of {Xi,Y) and {X 2 ,Y), which satisfy the following 
axioms: 

(1) Non-negativity: Tl{XiX 2 ^Y), U{Xi^Y\X 2 ) > 0. 

(2) I{Xp,Y) =n{X^X2^Y) +U{X^^Y\X2). 

(3) I{X^X2]Y) > n{X^X2^Y) +U{Xi^Y\X2) +U{X2^Y\Xi). 

(4) Weak symmetry I: 7l{XiX2^Y) = TZ(X 2 Xi^Y). 

Above, Axiom (3) states that the sum of the redundant and corresponding unique predictabili¬ 
ties given by each variable cannot be larger than the total predictability^. Axiom (4) states that the 
redundancy is independent of the ordering of the predictors. The following Lemma determines 
the bounds for the redundant predicability (the proof is given in Appendix A). 

Lemma 1: The functions n{XiX 2 ^ Y) and U{Xi ^ Y\X 2 ) = /(Xi;F) - 7 ^(XlX 2 ^ Y) 
satisfy Axioms (l)-(3) if and only if 

mm{I{Xp,Y),I{X2-,Y)} > n{X,X2^Y) > [/(X^; F)]+ , (13) 

where [a]+ = max{a, 0 }. 

Corollary 2: There always exists at least one predictability decomposition that satisfies Ax¬ 
ioms (l)-(4), which is given by 

7^(XlX2^y) :=min{J(Xi;X),J(X 2 ;X)}. (14) 

^In fact, the difference between the right and left hand terms of Axiom (3) gives the synergistic predictability, whose analysis 


will not be included in this work. 
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Proof: Being a symmetric function on Xi and X 2 , (14) satisfies Axiom (4). Also, as (14) 
is equal to the upper bound given in Lemma 1, Axioms (l)-(3) are satisfied due to Lemma 3. 

■ 

In principle, the notion of redundant predictability takes the point of view of the target variable 
and measures the parts that can be predicted by both Xi and X 2 when they are used by 
themselves, i.e., without combining them with each other. It is appealing to think that there 
should exist a unique function that provides such a measure. Nevertheless, these axioms define 
only very basic properties that a measure of redundant predictability should satisfy, and hence 
in general they are not enough for defining an unique function. In fact, a number of different 
predictability decompositions have been proposed in the literature [35], [36], [38], [39]. 

It is to be noted that, from all the candidates that are compatible with the Axioms, the 
decomposition given in Corollary 2 gives the largest possible redundant predictability measure. 

It is clear that in some cases this measure gives an over-estimate of the redundant predictability 
given by Xi and X 2 ; for an example of this consider Xi and X 2 to be independent variables and 
Y = {Xi,X 2 ). Nevertheless, (14) has been proposed as a adequate measure for the redundant 
predictability of multivariate Gaussians [39] (for a corresponding discussion see Section VI). 

B. Shared, private and synergistic information 

Let us now introduce an additional axiom, which will form the basis for our proposed 
information decomposition. 

Definition A symmetrical information decomposition is given by the real valued functions 
Ir^{Xi] X 2 ] Xf) and /priv(3fi; X 2 IX 3 ) over the marginal distributions of (Xi,X 2 ), (Xi,X 3 ) and 
(V 2 , X 3 ), which satisfy Axioms (1) - (4) for /n(Xi; ^ 2 ; Xfj := n{XiX 2 ^Xfj and /priv(Xi;X 2 IX 3 ) 
U{Xi^X 2 \Xfj, while also satisfying the following property: 

(5) Weak symmetry II: /priv(Xi; X 2 IX 3 ) = /priv(X 2 ; X 1 IX 3 ). 

Finally, /s(Xi;X 2 ;X 3 ) is defined as Is{Xp, X 2 -, X,) := /(X^; X 2 IX 3 ) -/priv(Xi; X 2 IX 3 ). 

The role of Axiom (5) can be related to the role of the fifth of Euclid’s postulates, as —while 
seeming innocuous— their addition has strong consequences in the corresponding theory. The 
following Lemma explains why this decomposition is denoted as symmetrical, and also shows 
fundamental bounds for these information functions (the proof is presented in Appendix C). 
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Lemma 3: The functions that compose a symmetrical information decomposition satisfy the 
following properties: 

(a) Strong symmetry: X 2 '., X^) and /s(Xi;X 2 ;X 3 ) are symmetric on their three argu¬ 

ments. 

(b) Bounds: these quantities satisfy the following inequalities: 

min{/(Xi;X2),/(X2;X3),/(X3;Xi)} > In{X,-X^', X,) > [I{X,-X^', Xs)]+ (15) 

min{/(Xi;X 3 ), J(Xi;X 3 |X 2 )} > /priv(Xi;X 3 IX 2 ) > 0 (16) 

min{/(Xi;X2|X3),/(X2;X3|Xi), J(X3;Xi|X2)} > Is{X^■ X^', X,) > [-/(X^; X 2 ; X3)] + 

(17) 

Note that the defined functions can be used to decompose the following mutual information: 


J(XiX2;X3) 

= /(Xi;X3) + /(X2;X3|Xi) 

(18) 

/(Xi;X3) 

= /n(Xi;X2;X3) + /priv(Xi;X3|X2) 

(19) 

J(X2;X3|Xi) 

= /priv(X 2 ; X 3 IX 1 ) + Js(Xi; X 2 ; X 3 ) 

(20) 


In contrast to a decomposition based on the predictability, these measures address properties 
of the system (Xi,X 2 ,X 3 ) as a whole, without being dependent on how it is divided between 
target and predictor variables (for a parallelism with respect to the corresponding predictability 
measures, see Table II). Intuitively, Jn(Xi;X 2 ;X 3 ) measures the shared information that is 
common to Xi, X 2 and X 3 ; /priv(Xi; X 3 IX 2 ) quantifies the private information that is shared 
by Xi and X 3 but not X 2 , and /s(Xi;X 2 ;X 3 ) captures the synergistic information that exist 
between (Xi,X 2 ,X 3 ). The latter is a non-intuitive mode of information sharing, whose nature 
we hope to clarify through the analysis of particular cases presented in Sections IV and VI. 


TABLE II 

Parallelism between predictability and information measures. 


Directed measures 

Symmetrical measures 

Redundant predictability 7 ?.(XiA 2 ^A' 3 ) 

Unique predictability U (Xi X2 Xa) 

Synergistic predictability 

Shared information In(X\-, X2\ Xf 

Private information 7priv(Aii; X 2 X 3 ) 

Synergistic information Is{Xi \ X2', X3) 
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Note also that the eo-information ean be expressed as 



( 21 ) 


Henee, a strietly positive (resp. negative) eo-information is a suffieient —although not neeessary— 
eondition for the system to have a non-zero shared (resp. synergistie) information. 

C. Further properties of the symmetrical decomposition 

At this point, it is important to elarify a fundamental distinetion that we make between the 
notions of predictability and information. The predietability is intrinsieally a direeted notion, 
whieh is based on a distinetion between predietors and the target variable. On the eontrary, we 
use the term information to exelusively refer to intrinsie statistieal properties of the whole system 
whieh do not rely on sueh distinetion. The main differenee between the two notions is that, in 
prineiple, the predietability only eonsiders the predietable parts of the target, while the shared 
information also eonsiders the joint statisties of the predietors. Although this distinetion will be 
further developed when we address the ease of Gaussian variables (e.f. Seetion VTC), let us for 
now present a simple example to help developing intuitions about this issue. 

Example Define the following funetions: 



( 22 ) 


(23) 


It is straightforward that these funetions satisfy Axioms (l)-(5), and therefore eonstitute a 
symmetrie information deeomposition. In eontrast to the deeomposition given in Corollary 2, 
this ean be seen to be strongly symmetrie and also dependent on the three marginals {Xi,X 2 ), 
(A 2 ,X 3 ) and (Xi,X 3 ). 

In the following Lemma we will generalize the previous eonstruetion, whose simple proof is 
omitted. 

Lemma 4: For a given predietability deeomposition with funetions TZ{XiX 2 X 3 ) and U(Xi 
X 2 IX 3 ), the funetions 



/priv(Xi;X2|X3) = /(Xi;X2) - In{Xp,X2;Xs) 


( 25 ) 
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provide a symmetrical information decomposition, which is called the canonical symmetrization 
of the predictability. 

Corollary 5: There always exists at least one symmetric information decomposition. 

Proof: This is a direct consequence of the previous Lemma and Corollary 2. ■ 

Maybe the most remarkable property of symmetrized information decompositions is that, in 
contrast to directed ones, they are uniquely determined by Axioms (l)-(5) for a number of 
interesting cases. 

Theorem 6: The symmetric information decomposition is unique if the variables form a 
Markov chain or two of them are pairwise independent. 

Proof: Let us consider the upper and lower bound for Jp given in (15), denoting them as 
Cl := [/(Xi; X 2 ; X 3 )]+ and C 2 := min{/(Xi; X 2 ),/(X 2 ; X 3 ), /(Xi; X 3 )}. These bounds restrict 
the possible Jn functions to lay in the interval [ci, C 2 ] of length 

|c2-ci| =min{J(Xi;X2),/(X2;X3),/(Xi;X3), (26) 

J(Xi; X2IX3), /(X2; X3IX1), /(X3; X1IX2)} . (27) 

Therefore, the framework will provide a unique expression for the shared information if (at least) 
one of the above six terms is zero. These scenarios correspond either to Markov chains, where 
one conditional mutual information term is zero, or pairwise independent variables where one 
mutual information term vanishes. ■ 

Pairwise independent variables and Markov chains are analyzed in Sections IV and V-A, 

respectively. 

D. Decomposition for the joint entropy of three variables 

Now we use the notions of redundant, private and synergistic information functions for 
developing a non-negative decomposition of the joint entropy, which is based on a non-negative 
decomposition of the DTC. For the case of three discrete variables, by applying (20) and (21) 
to (9), one finds that 

DTC = /priv(Xi;X2|X3) + Jpriv(X2;X3|Xi) + XfX^) 


+ /n(Xi;X2;X3) + 2/s(Xi;X2;X3) . 


( 28 ) 
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From (7) and (28), one can propose the following decomposition for the joint entropy: 


i7(Xi, X 2 , Xs) = 77(1) + A77(2) + A77(3). (29) 

where 

77(1) = 77(Xi|X2,X3) + 77(X2|Xi,X3) + 77(X3|Xi,X2) (30) 

A77(2) = 7pn,(Xi; X 2 IX 3 ) + 7priv(X2; X 3 IX 1 ) + 7priv(A3; A 1 IX 2 ) (31) 

A77(3) = 7n(Xi;X2;X3) + 27 s(Xi;X2;X3) (32) 


In contrast to (9), here each term is non-negative because of Lemma 3^. Therefore, (29) yields a 
non-negative decomposition of the joint entropy, where each of the corresponding terms captures 
the information that is shared by one, two or three variables. Interestingly, 77(i) and A77(2) are 
homogeneous (being the sum of all the exclusive information or private information of the 
system) while A77(3) is composed by a mixture of two different information sharing modes. 

An analogous decomposition can be developed for the case of continuous random variables. 
Nevertheless, as the differential entropy can be negative, not all the terms of the decomposition 
can be non-negative. In effect, following the same rationale that lead to (29), the following 
decomposition can be found: 

h(Xi, X2, X3) = /i(i) + A77(2) + A77(3). ( 33 ) 

Above, h{X) denotes the differential entropy of X, A77(2) and A77(3) are as defined in (31) and 
(32), and 

hii) = h{X,\X2Xs) + h{X2\X,Xs) + h{Xs\XiX2) . (34) 

Hence, although both the joint entropy 7,(Xi,X2,X3) and 7,(i) can be negative, the remaining 
terms conserve their non-negative condition. 

It can be seen that the lowest layer of the decomposition is always trivial to compute, and 
hence the challenge is to find expressions for A77(2) and A77(3). In the rest of the paper, we 
will explore scenarios were these quantities can be characterized. 

®From (20), it can be seen that the co-information is sometimes negative for compensating the triple counting of the synergy 
due to the sum of the three conditional mutual information terms. 
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IV. Pairwise independent variables 

In this section we focus on the case where two variables are pairwise independent while being 
globally connected by a third variable. The fact that pairwise independent variables can become 
correlated when additional information becomes available is known in statistics literature as the 
Bergson’s paradox or selection bias [40], or as the explaining away ejfect in the context of 
artificial intelligence [41]. As an example of this phenomenon, consider Xi and X2 to be two 
pairwise independent canonical Gaussians variables, and X 3 a binary variable that is equal to 1 
if Xi + X 2 > 0 and zero otherwise. Then, knowing that X 3 = 1 implies that X 2 > —Xi, and 
hence knowing the value of Xi effectively reduces the uncertainty about X2. 

In our framework, Bergson’s paradox can be understood as synergistic information that is 
introduced by the third component of the system. In fact, we will show that in this case the 
synergistic information function is unique and given by 

Is{Xp,X2-,X,) = J2 px^{x3)I{Xp,X 2\X3 = X3) = I{Xp,X2\X,) , (35) 

which is, in fact, a measure of the dependencies between Xi and X2 that are created by X3. 
In the following. Section IV-A presents the unique symmetrized information decomposition for 
this case. Then, Section IV-B focuses on the particular case where X 3 is a function of the other 
two variables. 

A. Uniqueness of the entropy decomposition 

Let us assume that Xi and X2 are pairwise independent, and hence the joint p.d.f. of Xi, X2 
and X3 has the following structure: 

PX^X 2 X 3 {Xl, X2, X3) = PxAxi)PX 2 ix 2 )PX 3 \XrX 2 {x 3 \Xi, X2) ■ (36) 

It is direct to see that in this case pxiXz = PV 1 X 2 V 3 = PxiPx 2 ^ but pxiX 2 \X 3 7 ^ Pxi\X 3 Px 2 \X 3 - 

Therefore, as I{Xi;X 2 ) = 0, it is direct from Axiom (1) that any redundant predictability 
function satisfies 7l{XiX3^X 2 ) = 71{X2X3^Xi) = 0. However, the axioms are not enough to 
uniquely determine 7l{XiX2^X3y. Nevertheless, the symmetrized decomposition is uniquely 
determined, as shown in the next Corollary that is a consequence of Theorem 6 . 

^Note that in this case I{Xi \ X2\ X3) = —I{X\\ X2\X3) < 0 , the only restriction that the bound presented in Lemma 3 
provides is min{ 7 (Xi; X3),/(X2; X3)} > 7^(XlX2^X3) > 0. 
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Corollary 7: If Xi, X 2 and X 3 follow a p.d.f. as (36), then the shared, private and synergetie 


information funetions are unique. They are given by 

/n(Xi; X 2 ; X 3 ) = Jpriv(Xi; X 2 IX 3 ) = 0 (37) 

/priv(Xi;X3|X2) = J(Xi;X3) (38) 

/pri,(X2;X3|Xi) = J(X2;X3) (39) 

Js(Xi;X 2 ;X 3 ) = I{Xv,X 2 \X^) = -J(Xi; X 2 ; X 3 ). (40) 


Proof: The faet that there is no shared information follows direetly from the upper bound 
presented in Lemma 3. Using this, the expressions for the private information ean be found using 
Axiom (2). Finally, the synergistie information ean be eomputed as 

/s(Xi;X2;X3) = I{Xv,X2\X^) - /priv(Xi;X 2 IX 3 ) = /(Xi;X2|X3) . (41) 

The seeond formula for the synergistie information ean be found then using the faet that 
/(Xi;X2) = 0. ■ 

With this eorollary, the unique deeomposition of the DTC = XH( 2 ) + Ai 7 ( 3 ) ean be found to 
be 


A/ 7 ( 2 ) = /(Xi; X 3 ) + /(X 2 ; X 3 ) (42) 

Ai7(3) = 2/(Xi;X2|X3) . (43) 

Note that the terms XH( 2 ) and Ai 7 ( 3 ) ean be bounded as follows: 

A/ 7 ( 2 ) < min{i7(Xi), H{Xs)} + mm{H{X 2 ), H{Xs)} , (44) 

A/7(3) < 2min{77(Xi|X3),i7(X2|X3)} . (45) 

The bound for XH{ 2 ) follows from the basie faet that I{X]Y) < mm{H{X), H{Y)}. The 
seeond bound follows from 

I{X-Y\Z) = J2Pziz)IiX-,Y\Z = z) (46) 

Z 

< ^pz(z)min{i/(X|Z = z),H{Y\Z = z)} (47) 

Z 

< min pz{z)H{X\Z = z),J2pz{z)H{Y\Z = z)| (48) 

= mm{H{X\Z),H{Y\Z)} . (49) 
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B. Functions of independent arguments 

Let us focus in this section on the special case where X 3 = F{Xi,X 2 ) is a function of 
two independent random inputs, and study its corresponding entropy decomposition. We will 
consider Xi and X 2 as inputs and F(Xi,X 2 ) to the output. Although this scenario fits nicely 
in the predictability framework, it can also be studied from the shared information framework’s 
perspective. Our goal is to understand how F affects the information sharing structure. 

As H{X 3 \Xi,X 2 ) = 0, we have 

//(i) = H{XfX2X,) + H{X2\X,X3) . (50) 

The term i7(i) hence measures the information of the inputs that is not reflected by the output. 
An extreme case is given by a constant function F{Xi,X 2 ) = k, for which Aif( 2 ) = = 0. 

The term XH( 2 ) measures how much of F can be predicted with knowledge that comes from 
one of the inputs but not from the other. If XH( 2 ) is large then F is not “mixing” the inputs too 
much, in the sense that each of them is by itself able to provide relevant information that is not 
given also by the other. In fact, a maximal value of AiL( 2 ) is given by F(Ai, A 2 ) = (Xi,X 2 ), 
where if(i) = Aif( 3 ) = 0 and the bound provided in (44) is attained. 

Finally, due to (43), there is no shared information and hence Aif( 3 ) is just proportional to the 
synergy of the system. By considering (45), one finds that F needs to leave some ambiguity about 
the exact values of the inputs in order for the system to possess synergy. For example, consider a 
1-1 function F for which for every output F(Xi, X 2 ) = X 3 one can find the unique values Xi and 
X 2 that generate it. Under this condition H{Xi\Xs) = H{X 2 \Xs) = 0 and hence, because of (45), 
is clear that a 1-1 function does not induce synergy. On the other extreme, we showed already 
that constant functions have AiT( 3 ) = 0 , and hence the case where the output of the system 
gives no information about the inputs also leads to no synergy. Therefore, synergistic functions 
are those whose output values generate a balanced ambiguity about the generating inputs. To 
develop this idea further, the next lemma studies the functions that generate a maximum amount 
of synergy by generating for each output value different 1-1 mappings between their arguments. 

Lemma 8: Let us assume that both Xi and X 2 take values over /C = {0,..., X — 1} and 
are independent. Then, the maximal possible amount of information synergy is created by the 
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function 

F* [n^m) = n + m (moA K) (51) 

when both inputs variables are uniformly distributed. 

Proof: Using the same rationale than in (49), it can be shown that if F is an arbitrary 
function then 


/s(Xi;X2;F(Xi,X 2)) = /(Xi;X2|F) 

(52) 

<mm{H{XfF),H{X2\F)} 

(53) 

<min{i7(Xi),i7(X2)} 

(54) 

< logiT . 

(55) 


where the last inequality follows from the fact that both inputs are restricted to alphabets of 
size K. 

Now, consider F* to be the function given in (51) and assume that Xi and X 2 are uniformly 
distributed. It can be seen that for each z E K, there exist exactly K ordered pairs of inputs 
{xi,X 2 ) such that F*{xi,X 2 ) = z, which define a bijection from K. to /C. Therefore, 

I{Xr,X 2 \F = z) = H{Xfz) - H{X 2 \X^,z) = H{X^) =\ogK (56) 

and hence 

/s(Xi;X 2 ;F*) = I{X^-X 2 \F*) = = z} ■ I{Xr,X 2 \F* = z) =\ogK , (57) 

Z 

showing that the upper bound presented in (55) is attained. ■ 

Corollary 9: The XOR logic gate generates the largest amount of synergistic information 
possible for the case of binary inputs. 

The synergistic nature of the addition over finite fields helps to explain the central role it 
has in various fields. In cryptography, the one-time-pad [42] is an encryption technique that 
uses finite-field additions for creating a synergistic interdependency between a private message, 
a public signal and a secret key. This interdependency is completely destroyed when the key 
is not known, ensuring no information leakage to unintended receivers [43]. Also, in network 
coding [44], [45], nodes in the network use linear combinations of their received data packets to 
create and transmit synergistic combinations of the corresponding information messages. This 
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technique has been shown to achieve the multicast capacity in wired communication networks 
[45] and has also been used to increase the throughput of wireless systems [46]. 

V. Discrete pairwise maximum entropy distributions and Markov chains 

This section studies the case where the system’s variables follow a pairwise maximum entropy 
(PME) distribution. These distributions are of great importance in statistical physics and machine 
learning communities, where they are studied under the names of Gibbs distributions [47] or 
Markov random fields [48]. 

Concretely, let us consider three pairwise marginal distributions and px^x^ for 

the discrete variables Xi, X 2 and X 3 . Let us denote as Q the set of all the joint p.d.f.s over 
{Xi, X 2 , X 3 ) that have those as their pairwise marginals distributions. Then, the corresponding 
PME distribution is given by the joint p.d.f. px{xi, X 2 , X 3 ) that satisfies 

Px = argmaxi7({p}) . (58) 

p&Q 

Eor the case of binary variables (i.e. Xj G {0,1}), the PME distribution is given by an Ising 
distribution [49]: 

p-£{X) 

Px(X) = , (59) 

where Z is a normalization constant and E (X) an energy function given by E (X) = fiXi + 
Jj,kXjXk, being the coupling terms. In effect, if .f^k = 0 for all i and k, then 
Px(X) can be factorized as the product of the unary-marginal p.d.f.s. 

In the context of the framework discussed in Section II-A, a PME system has TC = 
while = 0. In contrast. Section V-A studies these systems under the light of the decom¬ 

position of the DTC presented in Section III-D. Then, Section V-B specifies the analysis for the 
particular case of Markov chains. 

A. Synergy minimization 

It is tempting to associate the synergistic information with that which is only in the joint p.d.f. 
but not in the pairwise marginals, i.e. with AH^^\ However, the following result states that there 
can exist some synergy defined by the pairwise marginals themselves. 

Theorem 10: PME distributions have the minimum amount of synergistic information that is 
allowed by their pairwise marginals. 
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Proof: Note that 


max (X 1 X 2 X 3 ) = if (X 1 X 2 ) + ii(X 3 ) - min /(X 1 X 2 ; X 3 ) (60) 

pSQ pSQ 

= ii(XiX 2 ) + ii(X3) - /(Xi;X3) - min/(X 2 ;X 3 |Xi) (61) 

peS 

= ii(XiX 2 ) + ii(X 3 ) - /(Xi;X 3 ) - /priv(X 2 ;X 3 |Xi) - min Js(Xi;X 2 ;X 3 ) . 

pSQ 

(62) 

Therefore, maximizing the joint entropy for fixed pairwise marginals is equivalent to minimizing 
the synergistie information. Note that the last equality follows from the faet that X 3 IX 1 ) 

by definition only depends on the pairwise marginals. ■ 

Corollary 11: For an arbitrary system (Xi,X 2 ,X 3 ), the synergistie information ean be de- 
eomposed as 

Js(Xi; X 2 ; X 3 ) = is™'" + Aii(3) (63) 

where is as defined in (4) and = min^gg/s(Xi; X 2 ; X 3 ) is the synergistie informa¬ 

tion of the eorresponding PME distribution. 

Proof: This ean be proven noting that, for an arbitrary p.d.f. PX 1 X 2 X 3 , it can be seen that 


Aii(3) =maxH{X,X2Xs) - H^px^x^x^}) (64) 

P&Q 

=is({pxix: 2 x: 3 }) — minJs(Xi;X 2 ;X 3 ) . (65) 

P&Q 

Above, the first equality eorresponds to the definition of AH^^'> and the seeond equality eomes 
from using (62) on eaeh joint entropy term and noting that only the synergistie information 
depends on more than the pairwise marginals. ■ 

The previous eorollary shows that AH^^^ measures only one part of the information synergy 
of a system, the part that ean be removed without altering the pairwise marginals. Note that 
PME systems with non-zero synergy are easy to find. For an example, eonsider Xi and X 2 to 
be two independent equiprobable bits, and X 3 = Xi AND X 2 . It ean be shown that for this ease 
one has AH^^^ = 0 [16]. On the other side, as the inputs are independent the synergy ean be 
eomputed using (40), and therefore a direet ealeulation shows that 


Js(Xi;X2;X3) = J(Xi;X2|X3) = ii(Xi|X3) - ii(Xi|X2X3) = 0.1887 


( 66 ) 


From the previous diseussion, one ean eonelude that only a speeial elass of pairwise distri¬ 
butions pxiXi^PXiXa, and PX 2 X 3 are eompatible with having null synergistie information in the 
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system. This is a remarkable result, as the synergistie information is usually eonsidered to be an 
effeet purely related to high-order marginals. It would be interesting to have an expresion for 
the minimal information synergy that a set of pairwise distributions requires, or equivalently, a 
symmetrized information deeomposition for PME distributions. A partieular ease that allows a 
unique solution is diseussed in the next seetion. 


B. Markov chains 

Markov ehains maximize the joint entropy subjeet to eonstrains on only two of the three 
pairwise distributions. In effeet, following the same rationale as in the proof of Theorem 10, it 
ean be shown that 


H{X,,X 2 ,Xs) = H{X^X 2 ) + HiXs) - I{X,-X,) - /(XijXalXa) . (67) 


Then, for fixed pairwise distributions pxiX 2 PX 2 X 3 , maximizing the joint entropy is equivalent 
to minimizing the eonditional mutual information. Moreover, the maximal entropy is attained 
by the p.d.f. that makes /(Xi; X 3 IX 2 ) = 0, whieh is preeisely the Markov ehain Xi — X 2 — X 3 
with joint distribution 


PX1X2X3 — 


PX1X2PX2X3 

PX2 


( 68 ) 


For the binary ease, it ean be shown that a Markov ehain eorresponds to an Ising distribution 
like (59), where the interaetion terms Ji 3 is equal to zero. 

Theorem 6 showed that the symmetrie information deeomposition for Markov ehains is unique. 
We develop this deeomposition in the following eorollary. 

Corollary 12: If Xi — X 2 — X 3 is a Markov ehain, then their unique shared, private and 
synergistie information funetions are given by 


/n(Xi;X2;X3)=J(Xi;X3) (69) 

/priv(Xi;X2|X3) = /(Xi;X 2 ) - /(Xi;X3) (70) 

Jpriv(X2;X3|Xi) = /(X2;X3) - /(Xi;X3) (71) 

/s(Xi; X 2 ; X 3 ) = /priv(Xi; X 3 IX 2 ) = 0. (72) 


In partieular, Markov ehains have no synergistie information. 
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Proof: For this case one ean show that 

min {J(X,;X,)} = /(Xi;X 3 ) = /(Xi;X 2 ;X 3 ) , (73) 

where the first equality is a eonsequenee of the data proeess inequality, and the seeond of the faet 
that J(Xi;X 3 |X 2 ) = 0. The above equality shows that the bounds for the shared information 
presented in Lemma 3 give the unique solution Jn(Xi;X 2 ;X 3 ) = J(Xi;X 3 ). All the other 
equalities follow from this faet and their definition. ■ 

Using this eorollary, the unique deeomposition of the DTC = Ai7(2) + Ai7(3) for Markov 
ehains is given by 

A 7 /( 2 ) = /(Ai; X 2 ) + /(A 2 ; A 3 ) - 2J(Ai; A 3 ) , (74) 

A7/(3) = /(Ai;A3) . (75) 

Henee, eorollary 12 states that a suffieient eondition for three pairwise marginals to be eom- 
patible with zero information synergy is for them to satisfy the Markov eondition Px^iXi = 
Xlxa PxslXiPXilXi ■ The question of finding a neeessary condition is an open problem, intrinsieally 
linked with the problem of finding a good definition for the shared information for arbitrary PME 
distributions. 

For eoneluding, let us note an interesting duality that exists between Markov ehains and the 
ease where two variables are pairwise independent, whieh is illustrated in Table III. 

TABLE III 

Duality between Markov chains and pairwise independent variables 


Markov chains 

Pairwise independent variables 

Conditional pairwise independency 

7(Xi;X3|X2) = 0 

No Jpriv between Xi and X 3 

No synergistic information 

Pairwise independency 

7(Xi;X2) =0 

No 7priv between Xi and X 2 

No shared information 


VI. Entropy decomposition for the Gaussian case 

In this section we study the entropy-deeomposition for the ease where (Ai, A 2 , A 3 ) follow 
a multivariate Gaussian distribution. As the entropy is not affeeted by translation, we assume 
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without loss of generality, that all the variables have zero mean. The covarianee matrix is denoted 
as 


S = 


( 


\ 



acricr2 

/Saias 

aaia2 


ya20-s 

/3cri(T3 

7(T2Cr3 

^3 




/ 


(76) 


where af is the varianee of Xj, a is the correlation between X\ and X 2 , jS is the correlation 
between Xi and X 3 and 7 is the correlation between X 2 and X 3 . The condition that the matrix 
S should be positive semi-definite yields the following condition: 


1 + 2 q ;/37 — — 7^ > 0 


(77) 


Unfortunately, Theorem 6 implicitly states that Axioms (l)-(5) do not define a unique sym¬ 
metrical information decomposition for Gaussian variables with an arbitrary covariance matrix. 
Nevertheless, there are some interesting properties of their shared and synergistic information, 
which are discussed in Sections VTA and VTB. Then, Section VTC presents one symmetrical 
information decomposition that is consistent with these properties. 


A. Understanding the synergistic information between Gaussians 

The simplistic structure of the joint p.d.f. of multivariate Gaussians, which is fully determined 
by mere second order statistics, could make one to think that these systems do not have 
synergistic information sharing. However, it can be shown that a multivariate Gaussian is the 
maximum entropy distribution for a given covariance matrix S. Hence, the discussion provided 
in Section V-A suggests that these distributions can indeed have non-zero information synergy, 
depending on the structure of the pairwise distributions, or equivalently, on the properties of S. 

Moreover, it has been reported that synergistic phenomena are rather common among multi¬ 
variate Gaussian variables [39]. As a simple example, consider 

Xi = A + B, X2 = B, X3 = A, (78) 

where A and B are independent Gaussians. Intuitively, it can be seen that although X 2 is useless 
by itself for predicting X 3 , it can be used jointly with Xi to remove the noise term B and provide 
a perfect prediction. For refining this observation, let us consider a more general example where 
the variables have equal variances and X 2 and X 3 are independent (i.e. 7 = 0). Then, the optimal 
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predictor of given Xi is X^^ = aXi, the optimal predictor given X 2 is X^'^ = 0, and the 
optimal predictor given both Xi and X 2 is [50] 

= JL_{x,-aX2) . (79) 

1 — 

Therefore, although X 2 is useless to predict X3 by itself, it can be used for further improving 
the prediction given by Xi. Hence, all the information provided by X 2 is synergistic, as is useful 
only when combined with the information provided by Xi. Note that all these examples fall in 
the category of the systems considered in Section IV. 

B. Understanding the shared information 

Let us start studying the information shared between two Gaussians. For this, let us consider 
a pair of zero-mean variables {Xi,X 2 ) with unit variance and correlation a. A suggestive way 
of expressing these variables is given by 

Vi = IVi ± W 12 , X2 = W2± W 12 , (80) 

where Wi, W 2 and W 12 are independent centered Gaussian variables with variances si = = 

1 — |a| and S32 = |«|, respectively. Note that the signs in (80) can be set in order to achieve any 
desired sign for the covariance (as E {X1X2} = ±E {117^2} = =t'5^2)- The mutual information is 
given by (see Appendix D) 

/(Xi;X2) = -(l/2)log(l-«^) = -(l/2)log(l-4) , (81) 

showing that it is directly related to the variance of the common term W 12 . 

For studying the shared information between three Gaussian variables, let us start considering 
a case where al = = al = 1, a = (3 ■.= p and 7 = 0. It can be seen that (c.f. Appendix D) 

/(A',;A',;Jf3) = Log4^ ■ <82) 

A direct evaluation shows that (82) is non-positive^ for all p with \p\ < l/\/2 (note that |p| cannot 
be larger that 1/ a /2 because of condition (77)). Therefore, following the discussion related to 


*This is consistent with the fact that X 2 and X 3 are pairwise independent, and hence due to (40) one has that 
0 < 7s(Xi;X2;X3) =-7(Xi;X2;X3). 
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(21), this system has no shared information for all p and has zero synergistic information only 
for p = 0 . In contrast, let us now consider a case where a = (3 = 'j := p > 0 , for which 

/(X,i X3) = 1 log ^ ■ («) 

A direct evaluation shows that, in contrast to ( 82 ), the co-information in this case is non-negative, 
showing that the system is dominated by shared information for all p 7^ 0. 

The previous discussion suggests that the shared information depends on the smallest of the 
correlation coefficients. An interesting approach to understand this fact can be found in [ 39 ], 
where the predictability among Gaussians is discussed. In this work, the authors note that from 
the point of view of X3 both Xi and X2 are able to decompose the target in a predictable 
and an unpredictable portion: X3 = X3 -f E. In this sense, both predictors achieve the same 
effect although with a different efficiency, which is determined by their correlation coefficient. 
As a consequence of this, the predictor that is less correlated with the target does not provide 
unique predictability and hence its contribution is entirely redundant. This motivates the following 
redundant predictability measure: 

n{X,X2^Xs) ;= min{/(Xi;X3),/(X2;X3)}. ( 84 ) 

C. Shared, private and synergistic information for Gaussian variables 

Let us use the intuitions developed in the previous section for building a symmetrical informa¬ 
tion decomposition. For this, we use the decomposition given by the following Lemma (whose 
proof is presented in Appendix E). 

Lemma 13: Let (Xi,X2,X3) follow a multivariate Gaussian distribution with zero mean and 
covariance matrix S with a > (3 > y > 0 . Then 

- = Si 23 kFi 23 -f 512^^12 + - 5131 X 13 -f SilXi ( 85 ) 

C^l 

^2 

— = -S123IX123 + -S12IX12 + -S2IX2 ( 86 ) 

cr2 

X3 

— = .S123IX123 -f .S13IX13 + S3IX3 ( 87 ) 

(^3 

where IX123, IX12, IX13, IXi, IX2 and IX3 are independent standard Gaussians and S123, S12, S13, Si, S2 
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and S3 are given by 

S123 = VTj ■512 = \Ja — S13 = \//3 — 7, 

Si = \/l - a - /9 + 7, S2 = \/l - a, S3 = a /1 - ( 3 . 


(88) 


It is natural to relate S123 with the shared information, S12 and S13 with the private information 
and si, S2 and S3 with the exelusive terms. Note that the deeomposition presented in Lemma 13 
is unique in not requiring a private eomponent between the two less eorrelated variables —i.e. a 
term 1 L 23 - Henee, based on Lemma 13 and ( 81 ), we propose the following symmetrie information 
deeomposition for Gaussians: 


/n(Xi; X2; X3) = -- log(l - min{a^ ^ 

Jpriv(Xi;X2|X3) = /(Xi;X2) - UX^-X^.X^) 

min{a2, /)2^ 


1, 1 
= 2 log- 




Js(Xi;X2;X3) = J(Xi;X2|X3) - /pri,(Xi;X 2 IX 3 ) 

^ 1, _ (l-a^)(l-/)^)(l- 7 ^) _ 

2 (1 + 2aj3^ — a2 — /)2 — 72)(1 — min{a2, 


( 89 ) 

( 90 ) 

( 91 ) 

( 92 ) 

( 93 ) 


First, note that the above shared information eoineides with what was expeeted from Lemma 13 , 
as for the general ease Sp23 = min{|a|, |/)|, I7I}. Also, ( 91 ) is eonsistent with the faet that the two 
less eorrelated Gaussians share no private information. Moreover, by eomparing ( 93 ) and ( 122 ), 
it ean be seen that if Xi and X2 are the less eorrelated variables then the synergistie information 
ean be expressed as Is{Xi; X2; X^) = /(Xi;X2|X3), whieh for the partieular ease of a = 0 
eonfirms ( 40 ). This in turn also shows that, for the partieular ease of Gaussians variables, forming 
a Markov ehain is a neeessary and suffieient eondition for having zero information synergy^. 

Finally, by noting that ( 89 ) ean also be expressed as 


Jn(Xi; X2; X3) = min{/(Xi; X2), /(X2; X3), /(Xi; X3)} 


( 94 ) 


it ean be seen that our definition of shared information eorresponds to the eanonieal sym- 
metrization of ( 84 ) as diseussed in Lemma 4 . In eontrast with ( 84 ), ( 94 ) states that there eannot 
be information shared by the three eomponents of the system if two of them are pairwise 


^For the case of a > /3 > 7, a direct calculation shows that I{Xi \ X 2 IX 3 ) = 0 is equivalent to 7 = a/?. 
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independent. Therefore, the magnitude of the shared information is governed by the lowest 
eorrelation eoeffieient of the whole system, being upper-bounded by any of the redundant 
predictability terms. 

To close this section, let us note that (94) corresponds to the upper bound provided by 
(15), which means that multivariate Gaussians have a maximal shared information. This is 
complementary to the fact that, because of being a maximum entropy distribution, they also 
have the smallest amount of synergy that is compatible with the corresponding second order 
statistics. 


VII. Applications to Network Information Theory 

In this section we use the framework presented in Section III to analyze four fundamental 
scenarios in network information theory [51]. Our goal is to illustrate how the framework can 
be used to build new intuitions over these well-known optimal information-theoretic strategies. 
The application of the framework to scenarios with open problems is left for future work. 

In the following. Section VII-A uses the general framework to analyze the Slepian-Wolf 
coding for three sources, which is a fundamental result in the literature of distributed source 
compression. Then, Section VII-B applies the results of Section IV to the multiple access channel, 
which is one of the fundamental settings in multiuser information theory. Section VITC uses 
the results related to Markov chains from Section V to the wiretap channel, which constitutes 
one of the main models of information-theoretic secrecy. Finally, Section VITD uses results 
from Section VI to study fundamental limits of public or private broadcast transmissions over 
Gaussian channels. 

A. Slepian-Wolf coding 

The Slepian-Wolf coding gives lower bounds for the data rates that are required to transfer 
the information contained in various data sources. Let us denote as the data rate of the fc-th 
source and define Rk = Rk — H{Xk\'K.l) as the extra data rate that each source has above their 
own exclusive information (c.f. Section II-B). Then, in the case of two sources Xi and X 2 , the 
well-known Slepian-Wolf bounds can be re-written as Ri > 0, R 2 > 0, and > -^(-^ 1 ; -^ 2 ) 

[51, Section 10.3]. The last inequality states that I{Xi;X 2 ) corresponds to shared information 
that can be transmitted by any of the two sources. 
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Let us consider now the case of three sources, and denote Rs = /s(Xi; X 2 ; X 3 ). The Slepian- 
Wolf bounds provide seven inequalities [51, Section 10.5], which can be re-written as 

Ri>0, lE {1,2,3} (95) 

Ri + Rj > IpriviXi] Xj\Xk) + Rs for i,j, k E {1, 2 , 3}, i < j (96) 

Ri + R 2 R R 3 ^ XH^ 2 ) + ^-^( 3 ) (97) 

Above, (97) states that the DTC needs to be accounted by the extra rate of the sources, and (96) 
that every pair needs to to take care of their private information. Interestingly, due to (32) the 
shared information needs to be included in only one of the rates, while the synergistic information 
needs to be included in at least two. For example, one possible solution that is consistent with 
these bounds is R^ = UX^-X^, X^) + X2\X^) + X^\X^) + /s(Xi; X2; X3), 

4 = /priv(X 2 ;X 3 |Xi) + /s(Xi;X 2 ;X 3 ) and R^ = 0 . 

B. Multiple Access Channel 

Let us consider a multiple access channel, where two pairwise independent transmitters send 
Xi and X 2 and a receiver gets X 3 as shown in Fig. 3. It is well-known that, for a given distribution 
(Xi,X 2 ) ~ p{xi)p{x 2 ), the achievable transmission rates Ri and R 2 satisfy the constrains [51, 
Section 4.5] 

R, < /(Xi;X3|X2), R 2 < /(X 2 ;X 3 |Xi), R, + R 2 < I{X,, X 2 -, X^). (98) 

As the transmitted random variables are pairwise independent, one can apply the results of 
Section IV. Therefore, there is no shared information and /s(Xi;X 2 ;X 3 ) = /(Xi;X 3 |X 2 ) — 
/(Xi; X 3 ). Let us introduce a shorthand notation for the remaining terms : Ci = /priv(Xi; X 3 IX 2 ) = 
J(Xi;X 3 ), C 2 = /priv(X 2 ;X 3 |Xi) = J(X 2 ;X 3 ) and Cs = /s(Xi; X 2 ; X 3 ). Then, one can re-write 
the bounds for the transmission rates as 

Ri ^ Cl -f C*s, R 2 C: C 2 -f Cs Ri -f R 2 ^ Ci C 2 -f C*s- (99) 

From this, it is clear that while each transmitter has a private portion of the channel with capacity 
Cl or C 2 , their interaction creates synergistically extra capacity Cs that corresponds to what can 
be actually shared. 
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Ri 


transmitters 


^lO 


receiver 



X 20 


PXalXi.Xj -0-^3 


■ Cs = Is{Xi-X^-X3) 

■ Cl = /priv(Xi;X3|X2)= /(Jfi; Jfa) 


* ■ C2 = /p,iv(X2;X3|Xi)= I{X2;Xs) 


Fig. 3. Capacity region of the Multiple Access Channel, which represents the possible data-rates that two transmitters can use 
for transferring information to one receiver. 

C. Degraded Wiretap Channel 

Consider a communication system with an eavesdropper (shown in Fig. 4 ), where the trans¬ 
mitter sends Xi, the intended receiver gets X2 and the eavesdropper receives X3. For simplicity 
of the exposition, let us consider the case where the eavesdropper get only a degraded copy of 
the signal received by the intended receiver, i.e. that Xi — X2 — X3 form a Markov chain. Using 
the results of Section V-B, one can see that in this case there is no synergistic but only shared 
and private information between Xi, X2 and X3. 



Fig. 4. The rate of secure information transfer, C^c, is the portion of the mutual information that can be used while providing 
perfect confidentiality with respect to the eavesdropper. 


In this scenario, it is known that for a given input distribution px^ the rate of secure commu- 
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nication that can be achieved is upper bounded by [42, Section 3.4] 


Csec = I{X,-X2) - /(Xi;X3) = /priv(Xi;X2|X3), 


(100) 


which is precisely the private information sharing between Xi and X 2 . Also, as intuition 
would suggest, the eavesdropping capacity is equal to the shared information between the three 
variables: 


Ceav = I{X,;X2) - Csec = I{Xi;Xs) = In(Xi; X2; X3). 


( 101 ) 


D. Gaussian Broadcast Channel 

Let us consider a Gaussian Broadcast Channel, where a transmitter sends a Gaussian signal 
Xi that is received as X 2 and X 3 by two receivers. Assuming that all these variables jointly 
Gaussian with zero mean and covariance matrix as given by (76), the transmitter can broadcast 
a public message, intended for both users, at a maximum rate Gpub given by [42, Section 5.1] 


Gpub = min{/(Xi;X2),/(Xi;X3)} = 11{X2X^^X^) , 


( 102 ) 


where the redundant predictability, 71{X2X^ Xi), between Gaussian variables is as defined 
in (84). On the other hand, if the transmitter wants to send a private (confidential) message to 
receiver 1, the corresponding maximum rate Gpriv that can be achieved in this case is given by 



where the last equality follows from Axiom (2). 

Interestingly, the predictability measures prove to be better suited to describe the commu¬ 
nication limits in the above scenario that their symmetrical counterparts. In effect, using the 
shared information would have underestimated the public capacity (c.f. Section VI-C). This 
opens the question whether or not directed measures could be better suited for studying certain 
communication systems, compared to their symmetrized counterparts. Even though a definite 
answer to this question might not be straightforward, we hope that future research will provide 
more evidence and a better understanding of this issue. 
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VIII. Conclusions 

In this work we propose an axiomatic framework for studying the interdependencies that 
can exist between multiple random variables as different modes of information sharing. The 
framework is based on a symmetric notion of information that refers to properties of the system as 
a whole. We showed that, in contrast to predictability-based decompositions, all the information 
terms of the proposed decomposition have unique expressions for Markov chains and for the 
case where two variables are pairwise independent. We also analyzed the cases of pairwise 
maximum entropy (PME) distributions and multivariate Gaussian variables. Finally, we illustrated 
the application of the framework by using it to develop a more intuitive understanding of the 
optimal information-theoretic strategies in several fundamental communication scenarios. 

The key insight that this framework provides is that although there is only one way in which 
information can be shared between two random variables, there are two essentially different ways 
of sharing between three. One of these ways is a simple extension of the pairwise dependency, 
where information is shared redundantly and hence any of the variables can be used to predict 
any other. The second way leads to the counter-intuitive notion of synergistic information sharing, 
where the information is shared in a way that the statistical dependency is destroyed if any of 
the variables is removed; hence, the structure exists in the whole but not in any of the parts. 
Information synergy has therefore been commonly related to statistical structures that exist only 
in the joint p.d.f. and not in low-order marginals. Interestingly, although we showed that indeed 
PME distributions posses the minimal information synergy that is allowed by their pairwise 
marginals, this minimum can be strictly positive. 

Therefore, there exists a connection between pairwise marginals and synergistic information 
sharing that is still to be further clarified. In fact, this phenomenon is related to the difference 
between the TC and the DTC, which is rooted in the fact that the information sharing modes 
and the marginal structure of the p.d.f. are, although somehow related, intrinsically different. 
This important distinction has been represented in our framework by the sequence of internal 
and external entropies. This new unifying picture for the entropy, negentropy, TC and DTC has 
shed new light in the understanding of high-order interdependencies, whose consequences have 
only begun to be explored. 
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Appendix A 
Proof of Lemma 3 

Proof: Let us assume that 7 ^(XlX 2 ^F) and U{Xi^Y\X 2 ) = /(Xi;F) - Tl{XiX 2 ^Y) 
satisfy Axioms (l)-(3). Then, 

/(Xi;y) > I{x^-Y) - U{X^^Y\X 2 ) (104) 

= n{XiX2^Y) (105) 

= I{X2 ,Y)-U{X2 ^Y\Xf < I{X2;Y) (106) 

where the inequalities are a eonsequenee of the non-negativity of U{Xi -^Y\X 2 ) and the third 
equality is due to the weak symmetry of the redundant predietability. For proving the lower 
bound, first notiee that Axiom (2) ean be re-written as 

I{X,X2;Y) > I{X,-,Y) + I{X2;Y) -n{X,X2^Y). (107) 

The lower bound follows eonsidering the non-negativity of 7 ^(XiX 2 Y) and by noting that 
I{X,-,Y) + I{X2;Y) - I{X,X2;Y) = J(Xi;X 2 ;F). 

The proof of the eonverse is direet, and left as an exereise to the reader. ■ 

Appendix B 

Proof of the consistency of Axiom ( 3 ) 

Let us show that min{/(Xi;X 2 ),/(Xi;X 2 )} > /(Xi;X 2 ;X 3 ), showing that the bounds 
defined by Axiom (3) always can be satisfied. For this, let us assume that the variables are 
ordered in a way such that /(Xi;X 2 ) = min{/(Xi; X 2 ),/(X 2 ; X 3 ),/(X 3 ; Xi)} holds. Then, as 
one can express J(Xi;X 2 ;X 3 ) = /(Xi,X 2 ) — /(Xi,X 2 |X 3 ), it is direct to show that 

min{J(Xi;X2),/(Xi;X2)}-/(Xi;X2;X3) >/(Xi;X2)-/(Xi;X2;X3) (108) 

= J(Xi;X 2 |X 3 ) (109) 

> 0 , ( 110 ) 


from where the desired result follows. 
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Appendix C 
Proof of Lemma 3 

Proof: The symmetry of /n(Xi; X 2 ; X 3 ) ean be direetly verified from its definition. The 
weak symmetry of Jpriv(3fi; X 3 IX 2 ) can be shown as follows: 

Jpri,(X3;Xi|X2) = /(X3;Xi) - /n(X3;X2;Xi) (111) 

= /(Xi;X3)-/n(Xi;X2;X3) (112) 

= /priv(Xi;X3|X2) . (113) 

The symmetry of /s(Xi;X 2 ;X 3 ) with respect to Xi and X 3 follows directly from its definition, 
the weak symmetry of /(Xi; X 3 IX 2 ) and the strong symmetry of /n(Xi; X 2 ; X 3 ). The symmetry 
with respect to Xi and X 2 can be shown using the definition of /s(Xi;X 2 ;X 3 ) and the strong 


symmetry of /n(Xi;X 2 ;X 3 ) and the co-information J(Xi;X 2 ;X 3 ) as follows: 

Js(X 2 ;Xi;X 3 ) = J(X 2 ;X 3 |Xi) - [/(X 2 ;X 3 ) - /n(X 2 ;Xi;X 3 )] (114) 

= J(Xi; X 2 ; X 3 ) + /n(X 2 ; Xi; X 3 ) (115) 

= J(Xi;X3|X2) -/(Xi;X3) + /n(Xi;X2;X3) (116) 

= Js(Xi;X2;X3) . (117) 


The bounds for /n(Xi; X 2 ; X 3 ), /priv(Xi; X 2 ; X 3 ) and /s(Xi; X 2 ; X 3 ) follow directly from the 
definition of these quantities and Axiom (3). Finally, d) is proven directly using those definitions, 
and the fact that the mutual information depend only on the pairwise marginals, while the 
conditional mutual information depends on the full p.d.f. ■ 


Appendix D 

Useful facts about Gaussians 


Here we list some useful expressions for Gaussian variables: 


/(A'liA'j) 


/(AiiAj.A'a) 


Log . 

il 1-7^ 

2 ^ 1 + 2q;/37 — 



(118) 

(119) 


( 120 ) 
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^23 I 


= iz., 

^ 2 ®l + 2 a/ 37 -a 2 _/ 52_^2 

1 , IS 13 E 23 I 

= 2 1^1 , 

^ 2 (l-a 2 )(l-/ 32 )(l-^ 2 ) 

1 IVI 

= o log 


S 12 S 13 S 23 I 


^12 — 


cr^ aa^ 


acr^ cr^ 


^13 — 


where |A| is a matrix determinant, and 

cr^ /9(J^ 

/3ct^ cr^ 

Appendix E 
Proof of Lemma 13 

Proof: Consider the following random variables 


V 

( 2 

2 \ 

1 

f a 

7^ \ 

^23 — 



/ ' 

1 7cr2 

2 


( 121 ) 

( 122 ) 

(123) 

(124) 

(125) 

(126) 


1^1 = 0'i(Si 23W^123 + Sl 2 hCl 2 + S 13 VE 13 + SiVEi) (127) 

1^2 = 0'2(Si 23W^123 + Sl2hCl2 + S2I42) (128) 

1^3 = <73 (Sl2314^123 + S134E13 + S34I3) (129) 

where 4Ei23, 4Ei2, 4Ei3, 4Ei, IE2 and IE3 are independent standard Gaussians and the parameters 

S123, S12, Si3, si, S 2 and S3 as defined in (88). Then, is direct to check that Y = (El, ¥ 2 , Yf) is a 
multivariate Gaussian variable with zero mean and covariance matrix Ey equal to (76). Therefore, 
(Yi, ^2,^3) and (Xi,X2,X3) have the same statistics, which proves the desired result. ■ 
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